Here we present the code used to calculate the vectors of RF and KC distances in two different ways:
gene (CDS+introns, nucleotide alignment),
exon (CDS, nucleotide alignment),
codon (CDS, codon alignment), and
amino acid (CDS, alignment of translated nucleotides)
The phylogenetic analysis were performed with these four matrices under Maximum likelihood (IQ-TREE) with different schemes of data partition (unpartitioned, partitioned by gene, and best scheme of partition) and the Multispecies Coalescent (MSC) method using a summary method (ASTRAL-II) and a complete search (SVDquartets).
Along this tutorial, we use the following acronyms: gene - ge exon - ex codon - cd amino acid - aa IQ-TREE - iq ASTRAL-II - as SVDQuartets - sv concatenated - c gene partition - gp best schme of partition - bs
First, upload the following packages:
Treespace package requires all trees to have the same terminals. In other words, if you have species (i.e., terminals) that lost one or a few genes, you will have to remove those species or those gene from the analysis. Here we removed three species that had a high number of missing data and 16 genes.
Tip: We used sumtrees.py to suppress the annotations and to root the trees. We need root the trees now to use it in KC. However, we will have to unroot them for RF part.
#First, using sumtrees.py, remove all the annotations of the tree
#for file in *.tre; do sumtrees.py -F newick --suppress-annotations -p -d 0 --root-target-at-outgroup "Apiales_Araliaceae_Aralia_undulata" -o ../output_nwk/${file//.tre/.nwk} $file; done
After we suppressed the annotations, we preparee a nexus file with all the trees and labelled them accordingly.
Setting the working directory before start.
setwd("/Users/deisejpg/Documents/1_Plastid_genome/phyloanalysis/treespace/update_feb2018")
getwd()
## [1] "/Users/deisejpg/Documents/1_Plastid_genome/phyloanalysis/treespace/update_feb2018"
Next, we removed three species from the dataset so all the trees have the same terminals. Here we also rooted the trees, needed for KC metric.
#Prepare a nexus file with the trees that you want to drop the tips
#trees_droptip <- read.nexus("./input_prep/dropping_tips.nex")
#As I have multiple trees in a single nexus file, I am using lapply in addition to drop.tip, so the three species indicated are removed from all trees
#trees_droptip <- lapply(trees_droptip, drop.tip, c("Caryophyllales_Caryophyllaceae_Silene_capitata", "Myrtales_Myrtaceae_Xanthostemon_chrysanthus", "Fabales_Polygalaceae_Polygala_alba"))
#Then I verify the class of my tree object and I assign it as a "multiphylo" object, meaning that it has multiple trees
#class(trees_droptip)
#class(trees_droptip) <- "multiPhylo"
#class(trees_droptip)
#Saving a file with rooted trees without the three terminals that we dropped
#write.tree(trees_droptip, file="./input_prep/tipsdropped_rooted.nwk", d=0)
#To make sure my trees are unrooted (necessary for RF) I use unroot() and then I save the new trees in a newick file
#trees_droptip <- unroot(trees_droptip)
#write.tree(trees_droptip, file="./input_prep/tipsdropped_unrooted.nwk", d=0)
Then we opened the file, and prepared a new nexus file with all species trees from all of our analysis. We labelled the trees with information about method, matrix, and partitioned scheme (for IQ-TREE).
Distance between species trees inferred using MSC and ML (concatenated and partitioned by gene and by best scheme):
#reading the nexus file with all the trees and confirming they are unrooted
tree_allspp_nrtd <- read.nexus("./input_unrooted/allsptrees_tipsdropped_unrooted.nex")
tree_allspp_nrtd
## 19 phylogenetic trees
#class(tree_allspp_nrtd)
is.rooted.multiPhylo(tree_allspp_nrtd)
## as_ge as_ex as_cd as_aa sv_ge sv_ex sv_cd iq_ge_c
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## iq_ex_c iq_cd_c iq_aa_c iq_ge_gp iq_ex_gp iq_cd_gp iq_aa_gp iq_ge_bs
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## iq_ex_bs iq_cd_bs iq_aa_bs
## FALSE FALSE FALSE
Getting the names of the trees:
tree_names_allspp_nrtd <- names(tree_allspp_nrtd)
tree_names_allspp_nrtd
## [1] "as_ge" "as_ex" "as_cd" "as_aa" "sv_ge" "sv_ex"
## [7] "sv_cd" "iq_ge_c" "iq_ex_c" "iq_cd_c" "iq_aa_c" "iq_ge_gp"
## [13] "iq_ex_gp" "iq_cd_gp" "iq_aa_gp" "iq_ge_bs" "iq_ex_bs" "iq_cd_bs"
## [19] "iq_aa_bs"
Now we calculate the distances between the trees using “RF” for unweighted Robinson-Foulds distance:
treespace_allspp_nrtd <- treespace(tree_allspp_nrtd, method = "RF", nf = 18)
## Warning in is.euclid(distmat): Zero distance(s)
#If you see "Zero distance(s)" you can check below and look for the tree(s) that are exactly the same
#head(treespace_allspp_nrtd)
#Here, iq_ex_c distance fom iq_ex_bs is 0, meaning that these two trees have no differences
#Now you can see the values of the PCoA axis
#head(treespace_allspp_nrtd$pco$li)
We use the round() and sum() to verify the percentage explained by each axis in the ordination:
#Eigen values are measurements of variance, so to calculate the percentage explained by each axis, just divide the eigen values by the sum of all eigen values
#Number three indicates three digits
round(treespace_allspp_nrtd$pco$eig/sum(treespace_allspp_nrtd$pco$eig),3)
## [1] 0.345 0.202 0.115 0.083 0.069 0.054 0.053 0.029 0.014 0.013 0.012
## [12] 0.005 0.003 0.002 0.001
#Axis1 = 0.345, Axis2 = 0.202, Axis3 = 0.115; Axis4 = 0.083, Axis5 = 0.069
Then we find the clusters or trees using unweighted Robinson-Foulds:
#Since this step is interactive, I used the lines below to determine that the number of clusters is 2
#id_cluster_nrtd <- findGroves(treespace_allspp_nrtd, method = "RF")
#"Please define a cutoff point:""
#40
#y
#Now I run the same code defining "nclust = 2"
id_cluster_nrtd <- findGroves(treespace_allspp_nrtd, method = "RF", nclust = 2)
Then we plot again to check:
plotGrovesD3(id_cluster_nrtd, treeNames = tree_names_allspp_nrtd)
#head(id_cluster_nrtd)
Now we include the information about the software utilized in the phylogenetic inference:
methods_as_sv_iq <- read.csv("allsptrees.csv", h=F)
class(methods_as_sv_iq)
## [1] "data.frame"
#head(methods_as_sv_iq)
We then gathered vectors with usefull information and added in a new dataframe which includes: the axis of the ordination calculated by treespace(), cluster information about the methods, and the names of the trees, and groups with groups that clustered, calculated with findGroves().
shorter_names <- c("ge", "ex", "cd", "aa", "ge", "ex", "cd", "ge_c", "ex_c", "cd_c", "aa_c", "ge_gp", "ex_gp", "cd_gp", "aa_gp", "ge_bs", "ex_bs", "cd_bs", "aa_bs")
df_nrtd <- data.frame(id_cluster_nrtd$treespace$pco$li, dataset=tree_names_allspp_nrtd,
Method=methods_as_sv_iq$V1, group=id_cluster_nrtd$groups,
name=rownames(id_cluster_nrtd$treespace$pco$li), Shorter_names=shorter_names)
#head(df_nrtd)
Now we included a last column with information whether the tree was inferred using MCM or ML method:
df_nrtd["coalesc_x_concat"]<- c("mcm", "mcm", "mcm", "mcm", "mcm", "mcm", "mcm", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree")
class(df_nrtd$coalesc_x_concat)
## [1] "character"
#head(df_nrtd)
Finally, we gathered the level of the three software utilized and reordered the “coalesc_x_concat” column so we have data from astral and svdq in sequence.
df_nrtd$Method <- factor(df_nrtd$Method, levels = as.ordered(levels(df_nrtd$Method)))
df_nrtd_arranged <- arrange(df_nrtd, desc(coalesc_x_concat))
#head(df_nrtd_arranged)
Now we plot the Robinson-Foulds graph.
#This is to put in order: "astral, svdq, iqtree"
df_nrtd$Method <- factor(df_nrtd$Method, levels(df_nrtd$Method)[c(1,3,2)])
#Now ploting the species tree landscape
plot1_RF <- ggplot(df_nrtd, aes(x=A1, y=A2, size=3, label=Shorter_names, color=Method, shape=Method)) +
geom_point(alpha=1) +
guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
geom_vline(xintercept = 0, linetype="dashed") +
geom_text_repel(size=6, box.padding=1.5, point.padding = 0.8, show.legend=FALSE) +
labs(x="PCoA1 (34.5%)", y=" PCoA2 (20.2%)") +
theme_classic() + ggtitle("Robinson-Foulds") +
theme(legend.text=element_text(size=20),legend.title=element_text(size=20), plot.title = element_text(size=20, hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(size=5))) +
theme(axis.title=element_text(size = 18,face="bold"), axis.text=element_text(size=14,face="bold")) +
scale_colour_manual(values = c("#000000", "#E69F00", "#56B4E9"))+
scale_shape_manual(values=c(17, 17, 19))
plot1_RF
Distance between species trees inferred using MCM and ML (concatenated and partitioned by gene and by best scheme):
trees_allspp_rtd <- read.nexus("./input_rooted/allsptrees_tipsdropped_rooted.nex")
trees_allspp_rtd
## 19 phylogenetic trees
trees_allspp_rtd <- root(trees_allspp_rtd, "Apiales_Araliaceae_Aralia_undulata", resolve.root = TRUE)
class(trees_allspp_rtd)
## [1] "multiPhylo"
is.rooted.multiPhylo(trees_allspp_rtd)
## as_ge as_ex as_cd as_aa sv_ge sv_ex sv_cd iq_ge_c
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## iq_ex_c iq_cd_c iq_aa_c iq_ge_gp iq_ex_gp iq_cd_gp iq_aa_gp iq_ge_bs
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## iq_ex_bs iq_cd_bs iq_aa_bs
## TRUE TRUE TRUE
trees_allspp_rtd_names <- names(trees_allspp_rtd)
trees_allspp_rtd_names
## [1] "as_ge" "as_ex" "as_cd" "as_aa" "sv_ge" "sv_ex"
## [7] "sv_cd" "iq_ge_c" "iq_ex_c" "iq_cd_c" "iq_aa_c" "iq_ge_gp"
## [13] "iq_ex_gp" "iq_cd_gp" "iq_aa_gp" "iq_ge_bs" "iq_ex_bs" "iq_cd_bs"
## [19] "iq_aa_bs"
Now we calculate the distances between the trees using KC, the default of treespace():
#Calculating the treespace
treespace_allspp_rtd <- treespace(trees_allspp_rtd, nf=18)
#head(treespace_allspp_rtd)
#head(treespace_allspp_rtd$pco$li)
We use the round() and sum() to verify the percentage explained by each axis in the ordination:
round(treespace_allspp_rtd$pco$eig/sum(treespace_allspp_rtd$pco$eig),3)
## [1] 0.972 0.015 0.010 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000
## [12] 0.000 0.000
#Axis1 = 97.2%, Axis2 = 1.5%, Axis3 = 1.0%
Then we find the clusters or trees using Kendall-Colijn metric:
id_cluster_rtd <- findGroves(treespace_allspp_rtd, method = "treeVec", nclust = 3)
#125
#y
Then we plot again to check:
plotGrovesD3(id_cluster_rtd, treeNames = trees_allspp_rtd_names)
Now we include the information about the software utilized in the phylogenetic inferrence:
methods_as_sv_iq <- read.csv("allsptrees.csv", h=F)
class(methods_as_sv_iq)
## [1] "data.frame"
#head(methods_as_sv_iq)
Note that the object above is a datafram, so “V1” is a vector and to gather its information we had to use “methods_as_sv_iq$V1” in the code below.
And then we prepared a dataframe with all the information that we will need to plot the data:
shorter_names <- c("ge", "ex", "cd", "aa", "ge", "ex", "cd", "ge_c", "ex_c", "cd_c", "aa_c", "ge_gp", "ex_gp", "cd_gp", "aa_gp", "ge_bs", "ex_bs", "cd_bs", "aa_bs")
df_rtd <- data.frame(id_cluster_rtd$treespace$pco$li, dataset=trees_allspp_rtd_names,
Method=methods_as_sv_iq$V1, group=id_cluster_rtd$groups,
name=rownames(id_cluster_rtd$treespace$pco$li), shorter_names=shorter_names)
#head(df_rtd)
df_rtd["coalesc_x_concat"]<- c("mcm", "mcm", "mcm", "mcm", "mcm", "mcm", "mcm", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree", "iqtree")
class(df_rtd$coalesc_x_concat)
## [1] "character"
#head(df_rtd)
df_rtd$Method <- factor(df_rtd$Method, levels = as.ordered(levels(df_rtd$Method)))
df_rtd_arranged <- arrange(df_rtd, desc(df_rtd$coalesc_x_concat))
#head(df_rtd_arranged)
#change order to avoid overlapping
df2_rtd_arranged <- arrange(df_rtd_arranged, desc(Method))
#head(df2_rtd_arranged)
df2_rtd_arranged$Method <- factor(df2_rtd_arranged$Method, levels(df2_rtd_arranged$Method)[c(1,3,2)])
plot2_KC <- ggplot(df2_rtd_arranged, aes(x=A1, y=A2, color=Method, size=4, label=shorter_names, shape=Method)) +
geom_point(alpha=1) +
guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
geom_vline(xintercept = 0, linetype="dashed") +
geom_text_repel(size=6, box.padding=1.5, point.padding = 0.8, show.legend=FALSE) +
labs(x="PCoA1 (97.2%)", y=" PCoA2 (1.5%)") +
theme_classic() + ggtitle("Kendall-Colijn") +
theme(legend.text=element_text(size=20),legend.title=element_text(size=20),
plot.title = element_text(size=20, hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(size=5))) +
theme(axis.title=element_text(size = 18,face="bold"), axis.text=element_text(size=14,face="bold")) +
scale_colour_manual(values = c("#000000", "#E69F00", "#56B4E9"))+
scale_shape_manual(values=c(17, 17, 19))
plot2_KC
Now we combine both graphs using plot_grid() First, create a legend for plot1_RF, then combine both graphs without a legend, and finalys adding a legend:
#Here we create an object that contains the legend that will be used in the grid to combine two graphs
legend <- get_legend(plot1_RF + theme(legend.position="bottom"))
two_plots <- plot_grid(plot1_RF + theme(legend.position="none"), plot2_KC +
theme(legend.position="none"),
align = 'vh',
labels = c("a)", "b)"),
hjust = -1, nrow = 2)
two_plots
#Now adding the legend
two_plots_onelegend <- plot_grid(two_plots, legend, rel_heights = c(6, .5), hjust = 1, nrow = 2)
#two_plots_onelegend
#Finally adding a title
title <- ggdraw() + draw_label("Species Trees Landscape", fontface='bold', size = 18)
sp_tree_plot <- plot_grid(title, two_plots_onelegend, ncol=1, rel_heights=c(.1, 1.5), rel_widths = .1)
sp_tree_plot
#Saving pdf determining the size of the figure:
save_plot("./Figure5.pdf", sp_tree_plot, base_width=6, base_height=12)
Now let’s look at species trees versus gene trees, using RF and KC metric for each dataset
#tree <- read.nexus("genetrees.nex")
#first let's confirm we dropped the tips of species that are not present in all trees and let's unroot the tree
#tree_gerf <- lapply(tree, drop.tip, c("Caryophyllales_Caryophyllaceae_Silene_capitata", "Myrtales_Myrtaceae_Xanthostemon_chrysanthus", "Fabales_Polygalaceae_Polygala_polygama"))
#class(tree)
#class(tree) <- "multiPhylo"
#class(tree)
#tree <- unroot(tree)
#write.tree(tree, file="genetrees_unrooted.nwk", d=0)
tree_gerf <- read.nexus("./input_unrooted/all_spgenetree_ge_rf.nex")
class(tree_gerf)
## [1] "multiPhylo"
tree_names_gerf <- names(tree_gerf)
tree_names_gerf
## [1] "astral" "svdq" "concatenated" "genepartition"
## [5] "bestscheme" "atpA" "atpB" "atpE"
## [9] "atpF" "atpH" "atpI" "cemA"
## [13] "matK" "ndhA" "ndhB" "ndhC"
## [17] "ndhD" "ndhE" "ndhF" "ndhG"
## [21] "ndhI" "ndhJ" "petB" "petD"
## [25] "petG" "petL" "petN" "psaA"
## [29] "psaB" "psaC" "psaI" "psaJ"
## [33] "psbA" "psbB" "psbC" "psbD"
## [37] "psbE" "psbF" "psbH" "psbI"
## [41] "psbJ" "psbK" "psbM" "psbN"
## [45] "psbT" "psbZ" "rbcL" "rpl14"
## [49] "rpl20" "rpl23" "rpl2" "rpl33"
## [53] "rpl36" "rpoB" "rpoC1" "rpoC2"
## [57] "rps11" "rps12" "rps14" "rps15"
## [61] "rps18" "rps2" "rps3" "rps4"
## [65] "rps7" "rps8" "ycf3"
is.rooted.multiPhylo(tree_allspp_nrtd)
## as_ge as_ex as_cd as_aa sv_ge sv_ex sv_cd iq_ge_c
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## iq_ex_c iq_cd_c iq_aa_c iq_ge_gp iq_ex_gp iq_cd_gp iq_aa_gp iq_ge_bs
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## iq_ex_bs iq_cd_bs iq_aa_bs
## FALSE FALSE FALSE
method_gerf <- read.csv("all_spgenetree.csv", h=F)
class(method_gerf)
## [1] "data.frame"
#Calculating the distances
treespace_gerf <- treespace(tree_gerf, method = "RF", nf=66)
#head(treespace_gerf)
#head(treespace_gerf$pco$li)
plotGrovesD3(treespace_gerf$pc, treeNames = tree_names_gerf)
#Detecting the number of groups
#id_cluster_gerf <- findGroves(treespace_gerf, method = "RF")
#250
#y
id_cluster_gerf <- findGroves(treespace_gerf, method = "RF", nclust = 2)
plotGrovesD3(id_cluster_gerf, treeNames = tree_names_gerf)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_gerf <- data.frame(id_cluster_gerf$treespace$pco$li, dataset=tree_names_gerf, Method=method_gerf$V1,
group=id_cluster_gerf$groups, name=rownames(id_cluster_gerf$treespace$pco$li))
#head(df_gerf)
#The names of all gene trees
#df_gerf["sptree_vs_getree_1"] <- c("as", "sv", "c", "gp", "bs", "atpA", "atpB", "atpE", "atpF", "atpH", "atpI", "cemA", "matK", "ndhA", "ndhB", "ndhC", "ndhD", "ndhE", "ndhF", "ndhG", "ndhI", "ndhJ", "petB", "petD", "petG", "petL", "petN", "psaA", "psaB", "psaC", "psaI", "psaJ", "psbA", "psbB", "psbC", "psbD", "psbE", "psbF", "psbH", "psbI", "psbJ ", "psbK ", "psbM", "psbN", "psbT", "psbZ", "rbcL", "rpl14", "rpl20", "rpl23", "rpl2", "rpl33", "rpl36", "rpoB", "rpoC1", "rpoC2", "rps11", "rps12", "rps14", "rps15", "rps18", "rps2", "rps3", "rps4", "rps7", "rps8", "ycf3" )
#head(df_gerf)
df_gerf["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_gerf)
#change order to avoid overlapping
df2_ge_arrangedrf <- arrange(df_gerf, desc(Method))
#head(df2_ge_arrangedrf)
#getting the name for the legend
#colnames(df2_ge_arrangedrf$Method)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_gerf$pco$eig/sum(treespace_gerf$pco$eig),3)
## [1] 0.157 0.049 0.042 0.041 0.040 0.039 0.037 0.036 0.035 0.033 0.031
## [12] 0.029 0.028 0.027 0.026 0.024 0.023 0.022 0.020 0.019 0.018 0.016
## [23] 0.015 0.014 0.014 0.012 0.011 0.011 0.011 0.009 0.009 0.008 0.008
## [34] 0.007 0.007 0.006 0.006 0.006 0.005 0.005 0.004 0.004 0.004 0.004
## [45] 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.001 0.001 0.001 0.001
## [56] 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000 0.000 0.000
#Axis1 = 0.157, Axis2 = 0.049, Axis3 = 0.042, Axis4 = 0.041, Axis5 = 0.040
#Now we plot the graph
df2_ge_arrangedrf$Method <- factor(df2_ge_arrangedrf$Method, levels(df2_ge_arrangedrf$Method)[c(1,5,4,2,3,6)])
plot_RF_ge <- ggplot(df2_ge_arrangedrf, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
geom_point(aes(color=Method, shape=Method),alpha=.8, size=4) +
guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
geom_vline(xintercept = 0, linetype="dashed") +
geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
labs(x="PCoA1 (15.7%)", y=" PCoA2 (4.9%)") +
theme_classic() + ggtitle("Robinson-Foulds - gene") +
theme(legend.text=element_text(size=12),legend.title=element_text(size=15),
plot.title = element_text(hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(size=5))) +
theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold")) +
scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_RF_ge
## Warning: Removed 59 rows containing missing values (geom_text_repel).
tree_gekc <- read.nexus("./input_rooted/all_spgenetree_ge_kc.nex")
tree_gekc
## 67 phylogenetic trees
tree_gekc <- root(tree_gekc, "Apiales_Araliaceae_Aralia_undulata", resolve.root = TRUE)
class(tree_gekc)
## [1] "multiPhylo"
is.rooted.multiPhylo(tree_gekc)
## astral svdq iq_concatenated iq_genepartition
## TRUE TRUE TRUE TRUE
## iq_bestscheme atpA atpB atpE
## TRUE TRUE TRUE TRUE
## atpF atpH atpI cemA
## TRUE TRUE TRUE TRUE
## matK ndhA ndhB ndhC
## TRUE TRUE TRUE TRUE
## ndhD ndhE ndhF ndhG
## TRUE TRUE TRUE TRUE
## ndhI ndhJ petB petD
## TRUE TRUE TRUE TRUE
## petG petL petN psaA
## TRUE TRUE TRUE TRUE
## psaB psaC psaI psaJ
## TRUE TRUE TRUE TRUE
## psbA psbB psbC psbD
## TRUE TRUE TRUE TRUE
## psbE psbF psbH psbI
## TRUE TRUE TRUE TRUE
## psbJ psbK psbM psbN
## TRUE TRUE TRUE TRUE
## psbT psbZ rbcL rpl14
## TRUE TRUE TRUE TRUE
## rpl20 rpl23 rpl2 rpl33
## TRUE TRUE TRUE TRUE
## rpl36 rpoB rpoC1 rpoC2
## TRUE TRUE TRUE TRUE
## rps11 rps12 rps14 rps15
## TRUE TRUE TRUE TRUE
## rps18 rps2 rps3 rps4
## TRUE TRUE TRUE TRUE
## rps7 rps8 ycf3
## TRUE TRUE TRUE
tree_names_gekc <- names(tree_gekc)
tree_names_gekc
## [1] "astral" "svdq" "iq_concatenated"
## [4] "iq_genepartition" "iq_bestscheme" "atpA"
## [7] "atpB" "atpE" "atpF"
## [10] "atpH" "atpI" "cemA"
## [13] "matK" "ndhA" "ndhB"
## [16] "ndhC" "ndhD" "ndhE"
## [19] "ndhF" "ndhG" "ndhI"
## [22] "ndhJ" "petB" "petD"
## [25] "petG" "petL" "petN"
## [28] "psaA" "psaB" "psaC"
## [31] "psaI" "psaJ" "psbA"
## [34] "psbB" "psbC" "psbD"
## [37] "psbE" "psbF" "psbH"
## [40] "psbI" "psbJ" "psbK"
## [43] "psbM" "psbN" "psbT"
## [46] "psbZ" "rbcL" "rpl14"
## [49] "rpl20" "rpl23" "rpl2"
## [52] "rpl33" "rpl36" "rpoB"
## [55] "rpoC1" "rpoC2" "rps11"
## [58] "rps12" "rps14" "rps15"
## [61] "rps18" "rps2" "rps3"
## [64] "rps4" "rps7" "rps8"
## [67] "ycf3"
#This is to identify metadata
method_gekc <- read.csv("all_spgenetree.csv", h=F)
#Calculating the treespace
treespace_gekc <- treespace(tree_gekc, nf=66)
#head(treespace_gekc)
#head(treespace_gekc$pco$li)
plotGrovesD3(treespace_gekc$pc, treeNames = tree_names_gekc)
#Detecting the number of groups
#id_cluster_gekc <- findGroves(treespace_gekc)
#600
#y
id_cluster_gekc <- findGroves(treespace_gekc, nclust = 2)
plotGrovesD3(id_cluster_gekc)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_gekc <- data.frame(id_cluster_gekc$treespace$pco$li, dataset=tree_names_gekc, Method=method_gekc$V1,
group=id_cluster_gekc$groups, name=rownames(id_cluster_gekc$treespace$pco$li))
#head(df_gekc)
df_gekc["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_gekc)
#change order to avoid overlapping
df2_ge_arrangedkc <- arrange(df_gekc, desc(Method))
#head(df2_ge_arrangedkc)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_gekc$pco$eig/sum(treespace_gekc$pco$eig),3)
## [1] 0.515 0.082 0.060 0.048 0.032 0.027 0.025 0.023 0.021 0.017 0.015
## [12] 0.014 0.013 0.012 0.010 0.010 0.008 0.007 0.007 0.006 0.004 0.004
## [23] 0.004 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.001
## [34] 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000 0.000
## [45] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [56] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
#Axis1 = 0.515, Axis2 = 0.082, Axis3 = 0.060, Axis4 = 0.048, Axis5 = 0.032
#Now we plot the graph
df2_ge_arrangedkc$Method <- factor(df2_ge_arrangedkc$Method, levels(df2_ge_arrangedkc$Method)[c(1,2,3,6,5,4)])
plot_KC_ge <- ggplot(df2_ge_arrangedkc, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
geom_point(aes(color=Method, shape=Method), alpha=.8, size=4) +
guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
geom_vline(xintercept = 0, linetype="dashed") +
geom_text_repel(size=6, box.padding=2, point.padding = 0.9, show.legend=FALSE) +
labs(x="PCoA1 (51.5%)", y=" PCoA2 (8.2%)") +
theme_classic() + ggtitle("Kendall-Colijn - gene") +
theme(legend.text=element_text(size=12),legend.title=element_text(size=15),
plot.title = element_text(hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(size=5))) +
theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold")) +
scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_KC_ge
## Warning: Removed 59 rows containing missing values (geom_text_repel).
setwd("/Users/deisejpg/Documents/1_Plastid_genome/phyloanalysis/treespace/update_feb2018/")
getwd()
## [1] "/Users/deisejpg/Documents/1_Plastid_genome/phyloanalysis/treespace/update_feb2018"
#first let's confirm we dropped the tips of species that are not present in all trees and let's unroot the tree
#tree_aarf <- lapply(tree_aarf, drop.tip, c("Caryophyllales_Caryophyllaceae_Silene_capitata",
# "Myrtales_Myrtaceae_Xanthostemon_chrysanthus",
# "Fabales_Polygalaceae_Polygala_polygama"))
#class(tree_aarf)
#class(tree_aarf) <- "multiPhylo"
#class(tree_aarf)
#tree_aarf <- unroot(tree_aarf)
#write.tree(tree_aarf, file="genetrees_aa_unrooted.nwk", d=0)
tree_aarf <- read.nexus("./input_unrooted/all_spgenetree_aa_rf.nex")
class(tree_aarf)
## [1] "multiPhylo"
is.rooted.multiPhylo(tree_aarf)
## astral concatenated genepartition bestscheme atpA
## FALSE FALSE FALSE FALSE FALSE
## atpB atpE atpF atpH atpI
## FALSE FALSE FALSE FALSE FALSE
## cemA matK ndhA ndhB ndhC
## FALSE FALSE FALSE FALSE FALSE
## ndhD ndhE ndhF ndhG ndhI
## FALSE FALSE FALSE FALSE FALSE
## ndhJ petB petD petG petL
## FALSE FALSE FALSE FALSE FALSE
## petN psaA psaB psaC psaI
## FALSE FALSE FALSE FALSE FALSE
## psaJ psbA psbB psbC psbD
## FALSE FALSE FALSE FALSE FALSE
## psbE psbF psbH psbI psbJ
## FALSE FALSE FALSE FALSE FALSE
## psbK psbM psbN psbT psbZ
## FALSE FALSE FALSE FALSE FALSE
## rbcL rpl14 rpl20 rpl23 rpl2
## FALSE FALSE FALSE FALSE FALSE
## rpl33 rpl36 rpoB rpoC1 rpoC2
## FALSE FALSE FALSE FALSE FALSE
## rps11 rps12 rps14 rps15 rps18
## FALSE FALSE FALSE FALSE FALSE
## rps2 rps3 rps4 rps7 rps8
## FALSE FALSE FALSE FALSE FALSE
## ycf3
## FALSE
tree_names_aarf <- names(tree_aarf)
tree_names_aarf
## [1] "astral" "concatenated" "genepartition" "bestscheme"
## [5] "atpA" "atpB" "atpE" "atpF"
## [9] "atpH" "atpI" "cemA" "matK"
## [13] "ndhA" "ndhB" "ndhC" "ndhD"
## [17] "ndhE" "ndhF" "ndhG" "ndhI"
## [21] "ndhJ" "petB" "petD" "petG"
## [25] "petL" "petN" "psaA" "psaB"
## [29] "psaC" "psaI" "psaJ" "psbA"
## [33] "psbB" "psbC" "psbD" "psbE"
## [37] "psbF" "psbH" "psbI" "psbJ"
## [41] "psbK" "psbM" "psbN" "psbT"
## [45] "psbZ" "rbcL" "rpl14" "rpl20"
## [49] "rpl23" "rpl2" "rpl33" "rpl36"
## [53] "rpoB" "rpoC1" "rpoC2" "rps11"
## [57] "rps12" "rps14" "rps15" "rps18"
## [61] "rps2" "rps3" "rps4" "rps7"
## [65] "rps8" "ycf3"
#This is to identify metadata
method_aarf <- read.csv("all_spgenetree_aa.csv", h=F)
#Calculating the treespace
treespace_aarf <- treespace(tree_aarf, method = "RF", nf=65)
#head(treespace_aarf)
#head(treespace_aarf$pco$li)
plotGrovesD3(treespace_aarf$pc, treeNames = tree_names_aarf)
#Detecting the number of groups
#id_cluster_aarf <- findGroves(treespace_aarf, , method = RF)
#300
#y
id_cluster_aarf <- findGroves(treespace_aarf, , method = RF, nclust = 2)
plotGrovesD3(id_cluster_aarf)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_aarf <- data.frame(id_cluster_aarf$treespace$pco$li, dataset=tree_names_aarf, Method=method_aarf$V1, group=id_cluster_aarf$groups, name=rownames(id_cluster_aarf$treespace$pco$li))
#head(df_aarf)
df_aarf["sptree_vs_getree"] <- c("as", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_aarf)
#change order to avoid overlapping
df2_aa_arrangedrf <- arrange(df_aarf, desc(Method))
#head(df2_aa_arrangedrf)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_aarf$pco$eig/sum(treespace_aarf$pco$eig),3)
## [1] 0.168 0.029 0.027 0.026 0.025 0.024 0.024 0.024 0.023 0.023 0.023
## [12] 0.022 0.022 0.021 0.021 0.021 0.021 0.021 0.020 0.020 0.020 0.019
## [23] 0.019 0.018 0.018 0.017 0.016 0.016 0.016 0.015 0.014 0.014 0.013
## [34] 0.012 0.012 0.012 0.011 0.011 0.010 0.010 0.009 0.009 0.009 0.008
## [45] 0.008 0.007 0.006 0.006 0.006 0.005 0.005 0.005 0.004 0.003 0.003
## [56] 0.003 0.002 0.001 0.001 0.001 0.000 0.000 0.000 0.000
#Axis1 = 0.168, Axis2 = 0.029, Axis3=0.027, Axis4=0.026, Axis5=0.025
#Now we plot the graph
#the next line is not necessary here because we don't have svdq_aa
df2_aa_arrangedrf$Method <- factor(df2_aa_arrangedrf$Method, levels(df2_aa_arrangedrf$Method)[c(1,2,5,4,3)])
plot_RF_aa <- ggplot(df2_aa_arrangedrf, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
geom_point(aes(color=Method, shape=Method), alpha=.8, size=4) +
guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
geom_vline(xintercept = 0, linetype="dashed") +
geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
labs(x="PCoA1 (16.8%)", y=" PCoA2 (2.9%)") +
theme_classic() + ggtitle("Robinson-Foulds - amino acid") +
theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(size=5))) +
theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold")) +
scale_colour_manual(values = c("#000000", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
scale_shape_manual(values=c(17,19, 19, 19, 19))
plot_RF_aa
## Warning: Removed 59 rows containing missing values (geom_text_repel).
#KC (amino acid matrix)
tree_aa <- read.nexus("./input_rooted/all_spgenetree_aa_kc.nex")
tree_aa
## 66 phylogenetic trees
tree_aa <- root(tree_aa, "Apiales_Araliaceae_Aralia_undulata", resolve.root = TRUE)
is.rooted.multiPhylo(tree_aa)
## astral concatenated genepartition bestscheme atpA
## TRUE TRUE TRUE TRUE TRUE
## atpB atpE atpF atpH atpI
## TRUE TRUE TRUE TRUE TRUE
## cemA matK ndhA ndhB ndhC
## TRUE TRUE TRUE TRUE TRUE
## ndhD ndhE ndhF ndhG ndhI
## TRUE TRUE TRUE TRUE TRUE
## ndhJ petB petD petG petL
## TRUE TRUE TRUE TRUE TRUE
## petN psaA psaB psaC psaI
## TRUE TRUE TRUE TRUE TRUE
## psaJ psbA psbB psbC psbD
## TRUE TRUE TRUE TRUE TRUE
## psbE psbF psbH psbI psbJ
## TRUE TRUE TRUE TRUE TRUE
## psbK psbM psbN psbT psbZ
## TRUE TRUE TRUE TRUE TRUE
## rbcL rpl14 rpl20 rpl23 rpl2
## TRUE TRUE TRUE TRUE TRUE
## rpl33 rpl36 rpoB rpoC1 rpoC2
## TRUE TRUE TRUE TRUE TRUE
## rps11 rps12 rps14 rps15 rps18
## TRUE TRUE TRUE TRUE TRUE
## rps2 rps3 rps4 rps7 rps8
## TRUE TRUE TRUE TRUE TRUE
## ycf3
## TRUE
class(tree_aa)
## [1] "multiPhylo"
tree_names_aa <- names(tree_aa)
tree_names_aa
## [1] "astral" "concatenated" "genepartition" "bestscheme"
## [5] "atpA" "atpB" "atpE" "atpF"
## [9] "atpH" "atpI" "cemA" "matK"
## [13] "ndhA" "ndhB" "ndhC" "ndhD"
## [17] "ndhE" "ndhF" "ndhG" "ndhI"
## [21] "ndhJ" "petB" "petD" "petG"
## [25] "petL" "petN" "psaA" "psaB"
## [29] "psaC" "psaI" "psaJ" "psbA"
## [33] "psbB" "psbC" "psbD" "psbE"
## [37] "psbF" "psbH" "psbI" "psbJ"
## [41] "psbK" "psbM" "psbN" "psbT"
## [45] "psbZ" "rbcL" "rpl14" "rpl20"
## [49] "rpl23" "rpl2" "rpl33" "rpl36"
## [53] "rpoB" "rpoC1" "rpoC2" "rps11"
## [57] "rps12" "rps14" "rps15" "rps18"
## [61] "rps2" "rps3" "rps4" "rps7"
## [65] "rps8" "ycf3"
#This is to identify metadata
method_aa <- read.csv("all_spgenetree_aa.csv", h=F)
#Calculating the treespace
treespace_aa <- treespace(tree_aa, nf=65)
#head(treespace_aa)
#head(treespace_aa$pco$li)
plotGrovesD3(treespace_aa$pc, treeNames = tree_names_aa)
#Detecting the number of groups
#id_cluster_aa <- findGroves(treespace_aa)
#1000
#y
id_cluster_aa <- findGroves(treespace_aa, nclust = 3)
plotGrovesD3(id_cluster_aa, treeNames = tree_names_aa)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_aa <- data.frame(id_cluster_aa$treespace$pco$li, dataset=tree_names_aa, Method=method_aa$V1,
group=id_cluster_aa$groups, name=rownames(id_cluster_aa$treespace$pco$li))
#head(df_aa)
df_aa["sptree_vs_getree"] <- c("as", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_aa)
#change order to avoid overlapping
df2_aa_arranged <- arrange(df_aa, desc(Method))
#head(df2_aa_arranged)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_aa$pco$eig/sum(treespace_aa$pco$eig),3)
## [1] 0.264 0.106 0.083 0.061 0.042 0.040 0.038 0.032 0.031 0.026 0.024
## [12] 0.021 0.020 0.017 0.016 0.016 0.014 0.013 0.011 0.010 0.010 0.009
## [23] 0.008 0.007 0.007 0.006 0.006 0.005 0.005 0.004 0.004 0.004 0.004
## [34] 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.001 0.001
## [45] 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.000
## [56] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
#Axis1 = 0.264; Axis2 = 0.106, Axis3 = 0.083, Axis4 = 0.061, Axis5 = 0.042
df2_aa_arranged$Method <- factor(df2_aa_arranged$Method, levels(df2_aa_arranged$Method)[c(1,2,5,4,3)])
#Now we plot the graph
plot_KC_aa <- ggplot(df2_aa_arranged, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
geom_point(aes(color=Method, shape=Method), alpha=.8, size=4) +
guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
geom_vline(xintercept = 0, linetype="dashed") +
geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
labs(x="PCoA1 (26.4%)", y=" PCoA2 (10.6%)") +
theme_classic() + ggtitle("Kendall-Colijn - amino acid") +
theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(size=5))) +
theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold")) +
scale_colour_manual(values = c("#000000", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
scale_shape_manual(values=c(17, 19, 19, 19, 19))
plot_KC_aa
## Warning: Removed 59 rows containing missing values (geom_text_repel).
tree_exrf <- read.nexus("./input_unrooted/all_spgenetree_ex_rf.nex")
is.rooted.multiPhylo(tree_exrf)
## astral svdq concatenated genepartition bestscheme
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_atpA iq_ex_atpB iq_ex_atpE iq_ex_atpF iq_ex_atpH
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_atpI iq_ex_cemA iq_ex_matK iq_ex_ndhA iq_ex_ndhB
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_ndhC iq_ex_ndhD iq_ex_ndhE iq_ex_ndhF iq_ex_ndhG
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_ndhI iq_ex_ndhJ iq_ex_petB iq_ex_petD iq_ex_petG
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_petL iq_ex_petN iq_ex_psaA iq_ex_psaB iq_ex_psaC
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_psaI iq_ex_psaJ iq_ex_psbA iq_ex_psbB iq_ex_psbC
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_psbD iq_ex_psbE iq_ex_psbF iq_ex_psbH iq_ex_psbI
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_psbJ iq_ex_psbK iq_ex_psbM iq_ex_psbN iq_ex_psbT
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_psbZ iq_ex_rbcL iq_ex_rpl14 iq_ex_rpl20 iq_ex_rpl23
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_rpl2 iq_ex_rpl33 iq_ex_rpl36 iq_ex_rpoB iq_ex_rpoC1
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_rpoC2 iq_ex_rps11 iq_ex_rps12 iq_ex_rps14 iq_ex_rps15
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_rps18 iq_ex_rps2 iq_ex_rps3 iq_ex_rps4 iq_ex_rps7
## FALSE FALSE FALSE FALSE FALSE
## iq_ex_rps8 iq_ex_ycf3
## FALSE FALSE
tree_names_exrf <- names(tree_exrf)
tree_names_exrf
## [1] "astral" "svdq" "concatenated" "genepartition"
## [5] "bestscheme" "iq_ex_atpA" "iq_ex_atpB" "iq_ex_atpE"
## [9] "iq_ex_atpF" "iq_ex_atpH" "iq_ex_atpI" "iq_ex_cemA"
## [13] "iq_ex_matK" "iq_ex_ndhA" "iq_ex_ndhB" "iq_ex_ndhC"
## [17] "iq_ex_ndhD" "iq_ex_ndhE" "iq_ex_ndhF" "iq_ex_ndhG"
## [21] "iq_ex_ndhI" "iq_ex_ndhJ" "iq_ex_petB" "iq_ex_petD"
## [25] "iq_ex_petG" "iq_ex_petL" "iq_ex_petN" "iq_ex_psaA"
## [29] "iq_ex_psaB" "iq_ex_psaC" "iq_ex_psaI" "iq_ex_psaJ"
## [33] "iq_ex_psbA" "iq_ex_psbB" "iq_ex_psbC" "iq_ex_psbD"
## [37] "iq_ex_psbE" "iq_ex_psbF" "iq_ex_psbH" "iq_ex_psbI"
## [41] "iq_ex_psbJ" "iq_ex_psbK" "iq_ex_psbM" "iq_ex_psbN"
## [45] "iq_ex_psbT" "iq_ex_psbZ" "iq_ex_rbcL" "iq_ex_rpl14"
## [49] "iq_ex_rpl20" "iq_ex_rpl23" "iq_ex_rpl2" "iq_ex_rpl33"
## [53] "iq_ex_rpl36" "iq_ex_rpoB" "iq_ex_rpoC1" "iq_ex_rpoC2"
## [57] "iq_ex_rps11" "iq_ex_rps12" "iq_ex_rps14" "iq_ex_rps15"
## [61] "iq_ex_rps18" "iq_ex_rps2" "iq_ex_rps3" "iq_ex_rps4"
## [65] "iq_ex_rps7" "iq_ex_rps8" "iq_ex_ycf3"
#This is to identify metadata
method_exrf <- read.csv("all_spgenetree.csv", h=F)
class(method_exrf)
## [1] "data.frame"
#Calculating the treespace
treespace_exrf <- treespace(tree_exrf, method = "RF", nf=66)
## Warning in is.euclid(distmat): Zero distance(s)
#head(treespace_exrf)
#head(treespace_exrf$pco$li)
plotGrovesD3(treespace_exrf$pc, treeNames = tree_names_exrf)
#Detecting the number of groups
#id_cluster_exrf <- findGroves(treespace_exrf, method = "RF")
#250
#y
id_cluster_exrf <- findGroves(treespace_exrf, method = "RF", nclust = 2)
plotGrovesD3(id_cluster_exrf, treeNames = tree_names_exrf)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_exrf <- data.frame(id_cluster_exrf$treespace$pco$li, dataset=tree_names_exrf, Method=method_exrf$V1,
group=id_cluster_exrf$groups, name=rownames(id_cluster_exrf$treespace$pco$li))
#head(df_exrf)
df_exrf["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_exrf)
#change order to avoid overlapping
df2_ex_arrangedrf <- arrange(df_exrf, desc(Method))
#head(df2_ex_arrangedrf)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_exrf$pco$eig/sum(treespace_exrf$pco$eig),3)
## [1] 0.147 0.050 0.041 0.040 0.039 0.037 0.036 0.035 0.031 0.031 0.030
## [12] 0.027 0.026 0.026 0.025 0.023 0.023 0.023 0.020 0.020 0.018 0.017
## [23] 0.017 0.016 0.014 0.014 0.013 0.011 0.011 0.010 0.009 0.009 0.009
## [34] 0.008 0.008 0.007 0.007 0.007 0.006 0.005 0.005 0.005 0.005 0.004
## [45] 0.004 0.004 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.002
## [56] 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.000
#Axis1 = 0.147; Axis2 = 0.050, Axis3 = 0.041, Axis4 = 0.040, Axis5 = 0.039
#Now we plot the graph
df2_ex_arrangedrf$Method <- factor(df2_ex_arrangedrf$Method, levels(df2_ex_arrangedrf$Method)[c(1,4,2,5,6,3)])
plot_RF_ex <- ggplot(df2_ex_arrangedrf, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
geom_point(aes(color=Method, shape=Method), alpha=.6, size=4) +
guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
geom_vline(xintercept = 0, linetype="dashed") +
geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
labs(x="PCoA1 (14.7%)", y=" PCoA2 (5.0%)") +
theme_classic() + ggtitle("Robinson-Foulds - exon") +
theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(size=5))) +
theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold")) +
scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_RF_ex
## Warning: Removed 59 rows containing missing values (geom_text_repel).
#KC (Exon matrix)
tree_ex <- read.nexus("input_rooted/all_spgenetree_ex_kc.nex")
is.rooted.multiPhylo(tree_ex)
## astral svdq concatenated genepartition bestscheme
## TRUE FALSE FALSE FALSE FALSE
## atpA atpB atpE atpF atpH
## TRUE TRUE TRUE TRUE TRUE
## atpI cemA matK ndhA ndhB
## TRUE TRUE TRUE TRUE TRUE
## ndhC ndhD ndhE ndhF ndhG
## TRUE TRUE TRUE TRUE TRUE
## ndhI ndhJ petB petD petG
## TRUE TRUE TRUE TRUE TRUE
## petL petN psaA psaB psaC
## TRUE TRUE TRUE TRUE TRUE
## psaI psaJ psbA psbB psbC
## TRUE TRUE TRUE TRUE TRUE
## psbD psbE psbF psbH psbI
## TRUE TRUE TRUE TRUE TRUE
## psbJ psbK psbM psbN psbT
## TRUE TRUE TRUE TRUE TRUE
## psbZ rbcL rpl14 rpl20 rpl23
## TRUE TRUE TRUE TRUE TRUE
## rpl2 rpl33 rpl36 rpoB rpoC1
## TRUE TRUE TRUE TRUE TRUE
## rpoC2 rps11 rps12 rps14 rps15
## TRUE TRUE TRUE TRUE TRUE
## rps18 rps2 rps3 rps4 rps7
## TRUE TRUE TRUE TRUE TRUE
## rps8 ycf3
## TRUE TRUE
tree_ex <- root(tree_ex, "Apiales_Araliaceae_Aralia_undulata", resolve.root = TRUE)
is.rooted.multiPhylo(tree_ex)
## astral svdq concatenated genepartition bestscheme
## TRUE TRUE TRUE TRUE TRUE
## atpA atpB atpE atpF atpH
## TRUE TRUE TRUE TRUE TRUE
## atpI cemA matK ndhA ndhB
## TRUE TRUE TRUE TRUE TRUE
## ndhC ndhD ndhE ndhF ndhG
## TRUE TRUE TRUE TRUE TRUE
## ndhI ndhJ petB petD petG
## TRUE TRUE TRUE TRUE TRUE
## petL petN psaA psaB psaC
## TRUE TRUE TRUE TRUE TRUE
## psaI psaJ psbA psbB psbC
## TRUE TRUE TRUE TRUE TRUE
## psbD psbE psbF psbH psbI
## TRUE TRUE TRUE TRUE TRUE
## psbJ psbK psbM psbN psbT
## TRUE TRUE TRUE TRUE TRUE
## psbZ rbcL rpl14 rpl20 rpl23
## TRUE TRUE TRUE TRUE TRUE
## rpl2 rpl33 rpl36 rpoB rpoC1
## TRUE TRUE TRUE TRUE TRUE
## rpoC2 rps11 rps12 rps14 rps15
## TRUE TRUE TRUE TRUE TRUE
## rps18 rps2 rps3 rps4 rps7
## TRUE TRUE TRUE TRUE TRUE
## rps8 ycf3
## TRUE TRUE
tree_names_ex <- names(tree_ex)
tree_names_ex
## [1] "astral" "svdq" "concatenated" "genepartition"
## [5] "bestscheme" "atpA" "atpB" "atpE"
## [9] "atpF" "atpH" "atpI" "cemA"
## [13] "matK" "ndhA" "ndhB" "ndhC"
## [17] "ndhD" "ndhE" "ndhF" "ndhG"
## [21] "ndhI" "ndhJ" "petB" "petD"
## [25] "petG" "petL" "petN" "psaA"
## [29] "psaB" "psaC" "psaI" "psaJ"
## [33] "psbA" "psbB" "psbC" "psbD"
## [37] "psbE" "psbF" "psbH" "psbI"
## [41] "psbJ" "psbK" "psbM" "psbN"
## [45] "psbT" "psbZ" "rbcL" "rpl14"
## [49] "rpl20" "rpl23" "rpl2" "rpl33"
## [53] "rpl36" "rpoB" "rpoC1" "rpoC2"
## [57] "rps11" "rps12" "rps14" "rps15"
## [61] "rps18" "rps2" "rps3" "rps4"
## [65] "rps7" "rps8" "ycf3"
#This is to identify metadata
method_ex <- read.csv("all_spgenetree.csv", h=F)
#Calculating the treespace
treespace_ex <- treespace(tree_ex, nf=66)
#head(treespace_ex$pco$li)
plotGrovesD3(treespace_ex$pc, treeNames = tree_names_ex)
#Detecting the number of groups
#id_cluster_ex <- findGroves(treespace_ex)
#580
#y
id_cluster_ex <- findGroves(treespace_ex, nclust = 3)
plotGrovesD3(id_cluster_ex, treeNames = tree_names_ex)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_ex <- data.frame(id_cluster_ex$treespace$pco$li, dataset=tree_names_ex, Method=method_ex$V1,
group=id_cluster_ex$groups, name=rownames(id_cluster_ex$treespace$pco$li))
#head(df_ex)
df_ex["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_ex)
#change order to avoid overlapping
df2_ex_arranged <- arrange(df_ex, desc(Method))
#head(df2_ex_arranged)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_ex$pco$eig/sum(treespace_ex$pco$eig),3)
## [1] 0.521 0.086 0.061 0.047 0.031 0.027 0.022 0.021 0.019 0.017 0.015
## [12] 0.013 0.012 0.010 0.010 0.009 0.008 0.007 0.006 0.006 0.005 0.004
## [23] 0.004 0.004 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002
## [34] 0.002 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## [45] 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [56] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
#Axis1 = 0.521 Axis2 = 0.086, Axis3 = 0.061, Axis4 = 0.047, Axis5 = 0.031
#Now we plot the graph
df2_ex_arranged$Method <- factor(df2_ex_arranged$Method, levels(df2_ex_arranged$Method)[c(1,3,6,2,4,5)])
plot_KC_ex <- ggplot(df2_ex_arranged, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
geom_point(aes(color=Method, shape=Method), alpha=.6, size=4) +
guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
geom_vline(xintercept = 0, linetype="dashed") +
geom_text_repel(size=6, box.padding=2, point.padding=0.8, show.legend=FALSE) +
labs(x="PCoA1 (52.1%)", y=" PCoA2 (8.6%)") +
theme_classic() + ggtitle("Kendall-Colijn - exon") +
theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(size=5))) +
theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold"))+
scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_KC_ex
## Warning: Removed 59 rows containing missing values (geom_text_repel).
#RF (codon matrix)
tree_cdrf <- read.nexus("./input_unrooted/all_spgenetree_cd_rf.nex")
is.rooted.multiPhylo(tree_cdrf)
## astral svdq concatenated genepartition bestscheme
## FALSE FALSE FALSE FALSE FALSE
## atpA atpB atpE atpF atpH
## FALSE FALSE FALSE FALSE FALSE
## atpI cemA matK ndhA ndhB
## FALSE FALSE FALSE FALSE FALSE
## ndhC ndhD ndhE ndhF ndhG
## FALSE FALSE FALSE FALSE FALSE
## ndhI ndhJ petB petD petG
## FALSE FALSE FALSE FALSE FALSE
## petL petN psaA psaB psaC
## FALSE FALSE FALSE FALSE FALSE
## psaI psaJ psbA psbB psbC
## FALSE FALSE FALSE FALSE FALSE
## psbD psbE psbF psbH psbI
## FALSE FALSE FALSE FALSE FALSE
## psbJ psbK psbM psbN psbT
## FALSE FALSE FALSE FALSE FALSE
## psbZ rbcL rpl14 rpl20 rpl23
## FALSE FALSE FALSE FALSE FALSE
## rpl2 rpl33 rpl36 rpoB rpoC1
## FALSE FALSE FALSE FALSE FALSE
## rpoC2 rps11 rps12 rps14 rps15
## FALSE FALSE FALSE FALSE FALSE
## rps18 rps2 rps3 rps4 rps7
## FALSE FALSE FALSE FALSE FALSE
## rps8 ycf3
## FALSE FALSE
#tree_ex <- unroot(tree_cdrf)
tree_names_cdrf <- names(tree_cdrf)
tree_names_cdrf
## [1] "astral" "svdq" "concatenated" "genepartition"
## [5] "bestscheme" "atpA" "atpB" "atpE"
## [9] "atpF" "atpH" "atpI" "cemA"
## [13] "matK" "ndhA" "ndhB" "ndhC"
## [17] "ndhD" "ndhE" "ndhF" "ndhG"
## [21] "ndhI" "ndhJ" "petB" "petD"
## [25] "petG" "petL" "petN" "psaA"
## [29] "psaB" "psaC" "psaI" "psaJ"
## [33] "psbA" "psbB" "psbC" "psbD"
## [37] "psbE" "psbF" "psbH" "psbI"
## [41] "psbJ" "psbK" "psbM" "psbN"
## [45] "psbT" "psbZ" "rbcL" "rpl14"
## [49] "rpl20" "rpl23" "rpl2" "rpl33"
## [53] "rpl36" "rpoB" "rpoC1" "rpoC2"
## [57] "rps11" "rps12" "rps14" "rps15"
## [61] "rps18" "rps2" "rps3" "rps4"
## [65] "rps7" "rps8" "ycf3"
#This is to identify metadata
method_cdrf <- read.csv("all_spgenetree.csv", h=F)
#Calculating the treespace
treespace_cdrf <- treespace(tree_cdrf, method = "RF", nf=66)
## Warning in is.euclid(distmat): Zero distance(s)
#head(treespace_cdrf)
#head(treespace_cdrf$pco$li)
plotGrovesD3(treespace_cdrf$pc, treeNames = tree_names_cdrf)
#Detecting the number of groups
#id_cluster_cdrf <- findGroves(treespace_cdrf, method = "RF")
#200
#y
id_cluster_cdrf <- findGroves(treespace_cdrf, method = "RF", nclust = 2)
plotGrovesD3(id_cluster_cdrf, treeNames = tree_names_cdrf)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_cdrf <- data.frame(id_cluster_cdrf$treespace$pco$li, dataset=tree_names_cdrf, Method=method_cdrf$V1,
group=id_cluster_cdrf$groups, name=rownames(id_cluster_cdrf$treespace$pco$li))
#head(df_cdrf)
df_cdrf["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_cdrf)
#change order to avoid overlapping
df2_cd_arrangedrf <- arrange(df_cdrf, desc(Method))
#head(df2_cd_arrangedrf)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_cdrf$pco$eig/sum(treespace_cdrf$pco$eig),3)
## [1] 0.147 0.050 0.041 0.040 0.039 0.037 0.036 0.035 0.032 0.031 0.030
## [12] 0.027 0.026 0.026 0.024 0.023 0.023 0.023 0.020 0.020 0.018 0.017
## [23] 0.017 0.016 0.014 0.014 0.013 0.011 0.011 0.010 0.009 0.009 0.009
## [34] 0.008 0.008 0.008 0.007 0.007 0.006 0.005 0.005 0.005 0.005 0.004
## [45] 0.004 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.002
## [56] 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000 0.000
#Axis1 = 0.147, Axis2 = 0.050, Axis3 = 0.041, Axis4 = 0.040, Axis5 = 0.039
#Now we plot the graph
df2_cd_arrangedrf$Method <- factor(df2_cd_arrangedrf$Method, levels(df2_cd_arrangedrf$Method)[c(1,2,3,6,5,4)])
plot_RF_cd <- ggplot(df2_cd_arrangedrf, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
geom_point(aes(color=Method, shape=Method), alpha=.6, size=4) +
guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
geom_vline(xintercept = 0, linetype="dashed") +
geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
labs(x="PCoA1 (14.7%)", y=" PCoA2 (0.05%)") +
theme_classic() + ggtitle("Robinson-Foulds - codon") +
theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(size=5))) +
theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold")) +
scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_RF_cd
## Warning: Removed 59 rows containing missing values (geom_text_repel).
#KC (codon matrix)
tree_cd <- read.nexus("./input_rooted/all_spgenetree_cd_kc.nex")
tree_cd
## 67 phylogenetic trees
is.rooted.multiPhylo(tree_cd)
## astral svdq concatenated genepartition bestscheme
## TRUE FALSE FALSE FALSE FALSE
## atpA atpB atpE atpF atpH
## TRUE TRUE TRUE TRUE TRUE
## atpI cemA matK ndhA ndhB
## TRUE TRUE TRUE TRUE TRUE
## ndhC ndhD ndhE ndhF ndhG
## TRUE TRUE TRUE TRUE TRUE
## ndhI ndhJ petB petD petG
## TRUE TRUE TRUE TRUE TRUE
## petL petN psaA psaB psaC
## TRUE TRUE TRUE TRUE TRUE
## psaI psaJ psbA psbB psbC
## TRUE TRUE TRUE TRUE TRUE
## psbD psbE psbF psbH psbI
## TRUE TRUE TRUE TRUE TRUE
## psbJ psbK psbM psbN psbT
## TRUE TRUE TRUE TRUE TRUE
## psbZ rbcL rpl14 rpl20 rpl23
## TRUE TRUE TRUE TRUE TRUE
## rpl2 rpl33 rpl36 rpoB rpoC1
## TRUE TRUE TRUE TRUE TRUE
## rpoC2 rps11 rps12 rps14 rps15
## TRUE TRUE TRUE TRUE TRUE
## rps18 rps2 rps3 rps4 rps7
## TRUE TRUE TRUE TRUE TRUE
## rps8 ycf3
## TRUE TRUE
tree_cd <- root(tree_cd, "Apiales_Araliaceae_Aralia_undulata", resolve.root = TRUE)
is.rooted.multiPhylo(tree_cd)
## astral svdq concatenated genepartition bestscheme
## TRUE TRUE TRUE TRUE TRUE
## atpA atpB atpE atpF atpH
## TRUE TRUE TRUE TRUE TRUE
## atpI cemA matK ndhA ndhB
## TRUE TRUE TRUE TRUE TRUE
## ndhC ndhD ndhE ndhF ndhG
## TRUE TRUE TRUE TRUE TRUE
## ndhI ndhJ petB petD petG
## TRUE TRUE TRUE TRUE TRUE
## petL petN psaA psaB psaC
## TRUE TRUE TRUE TRUE TRUE
## psaI psaJ psbA psbB psbC
## TRUE TRUE TRUE TRUE TRUE
## psbD psbE psbF psbH psbI
## TRUE TRUE TRUE TRUE TRUE
## psbJ psbK psbM psbN psbT
## TRUE TRUE TRUE TRUE TRUE
## psbZ rbcL rpl14 rpl20 rpl23
## TRUE TRUE TRUE TRUE TRUE
## rpl2 rpl33 rpl36 rpoB rpoC1
## TRUE TRUE TRUE TRUE TRUE
## rpoC2 rps11 rps12 rps14 rps15
## TRUE TRUE TRUE TRUE TRUE
## rps18 rps2 rps3 rps4 rps7
## TRUE TRUE TRUE TRUE TRUE
## rps8 ycf3
## TRUE TRUE
tree_names_cd <- names(tree_cd)
tree_names_cd
## [1] "astral" "svdq" "concatenated" "genepartition"
## [5] "bestscheme" "atpA" "atpB" "atpE"
## [9] "atpF" "atpH" "atpI" "cemA"
## [13] "matK" "ndhA" "ndhB" "ndhC"
## [17] "ndhD" "ndhE" "ndhF" "ndhG"
## [21] "ndhI" "ndhJ" "petB" "petD"
## [25] "petG" "petL" "petN" "psaA"
## [29] "psaB" "psaC" "psaI" "psaJ"
## [33] "psbA" "psbB" "psbC" "psbD"
## [37] "psbE" "psbF" "psbH" "psbI"
## [41] "psbJ" "psbK" "psbM" "psbN"
## [45] "psbT" "psbZ" "rbcL" "rpl14"
## [49] "rpl20" "rpl23" "rpl2" "rpl33"
## [53] "rpl36" "rpoB" "rpoC1" "rpoC2"
## [57] "rps11" "rps12" "rps14" "rps15"
## [61] "rps18" "rps2" "rps3" "rps4"
## [65] "rps7" "rps8" "ycf3"
#This is to identify metadata
method_cd <- read.csv("all_spgenetree.csv", h=F)
#Calculating the treespace
treespace_cd <- treespace(tree_cd, nf=66)
#head(treespace_cd)
#head(treespace_cd$pco$li)
plotGrovesD3(treespace_cd$pc, treeNames = tree_names_cd)
#Detecting the number of groups
#id_cluster_cd <- findGroves(treespace_cd)
#580
#y
id_cluster_cd <- findGroves(treespace_cd, nclust = 3)
plotGrovesD3(id_cluster_cd, treeNames = tree_names_cd)
#making a dataframe with the coordinates of the trees calculated by treespace(), cluster information, and rownmes
#for names of trees; group is not necessary because I have already calculated the number of groups with findGroves()
df_cd <- data.frame(id_cluster_cd$treespace$pco$li, dataset=tree_names_cd, Methods=method_cd$V1,
group=id_cluster_cd$groups, name=rownames(id_cluster_cd$treespace$pco$li))
#head(df_cd)
df_cd["sptree_vs_getree"] <- c("as", "sv", "c", "gp", "bs", NA, NA, NA, NA, NA, NA, NA, "matK", NA, NA, NA, NA, NA, "ndhF", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "rbcL", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
#head(df_cd)
#change order to avoid overlapping
df2_cd_arranged <- arrange(df_cd, desc(Methods))
#head(df2_cd_arranged)
#We use the round() and sum() to verify the percentage explained by each axis in the ordination
round(treespace_cd$pco$eig/sum(treespace_cd$pco$eig),3)
## [1] 0.521 0.086 0.061 0.047 0.031 0.027 0.022 0.021 0.019 0.017 0.015
## [12] 0.013 0.012 0.010 0.010 0.009 0.008 0.007 0.006 0.006 0.005 0.004
## [23] 0.004 0.004 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002
## [34] 0.002 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
## [45] 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [56] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
#Axis1 = 0.521, Axis2 = 0.086, Axis3 = 0.061, Axis4 = 0.047, Axis5 = 0.031
#Now we plot the graph
df2_cd_arranged$Method <- factor(df2_cd_arranged$Method, levels(df2_cd_arranged$Method)[c(1,3,6,5,4,2)])
plot_KC_cd <- ggplot(df2_cd_arranged, aes(x=A1, y=A2, color = Method, label = sptree_vs_getree)) +
geom_point(aes(color=Method, shape=Method), alpha=.6, size=4) +
guides(size=F) + geom_hline(yintercept = 0, linetype="dashed") +
geom_vline(xintercept = 0, linetype="dashed") +
geom_text_repel(size=6, box.padding=2, point.padding = 0.8, show.legend=FALSE) +
labs(x="PCoA1 (52.1%)", y=" PCoA2 (8.6%)") +
theme_classic() + ggtitle("Kendall-Colijn - codon") +
theme(legend.text=element_text(size=12),legend.title=element_text(size=15), plot.title = element_text(hjust = 0.5)) +
guides(colour = guide_legend(override.aes = list(size=5))) +
theme(axis.title=element_text(size = 12,face="bold"), axis.text=element_text(size=10,face="bold")) +
scale_colour_manual(values = c("#000000", "#E69F00", "#CC79A7", "#009E73", "#F0E442", "#56B4E9"))+
scale_shape_manual(values=c(17, 17, 19, 19, 19, 19))
plot_KC_cd
## Warning: Removed 59 rows containing missing values (geom_text_repel).
Now we prepared the final figures:
#GENE and AMINO ACIDS
#Now we combine both graphs using plot_grid()
#First combining both graphs without a legend
legend3 <- get_legend(plot_RF_ge + theme(legend.position="bottom"))
## Warning: Removed 59 rows containing missing values (geom_text_repel).
ge_aa_plots <- plot_grid(plot_RF_ge + theme(legend.position="none"),
plot_KC_ge + theme(legend.position="none"),
plot_RF_aa + theme(legend.position="none"),
plot_KC_aa + theme(legend.position="none"),
align = 'vh', labels = c("a)", "b)", "c)", "d)"), hjust = -1, nrow = 2)
## Warning: Removed 59 rows containing missing values (geom_text_repel).
## Warning: Removed 59 rows containing missing values (geom_text_repel).
## Warning: Removed 59 rows containing missing values (geom_text_repel).
## Warning: Removed 59 rows containing missing values (geom_text_repel).
ge_aa_plots
#Now adding the legend
ge_aa_plots_onelegend <- plot_grid(ge_aa_plots, legend3, ncol=1, rel_heights = c(3, .5))
ge_aa_plots_onelegend
#Finally adding a title
title <- ggdraw() + draw_label("Gene Trees Landscape", fontface='bold')
ge_aa_plot <- plot_grid(title, ge_aa_plots_onelegend, ncol=1, rel_heights = c(0.1, 1))
#Saving pdf determining the size of the figure:
save_plot("./Figure6.pdf", ge_aa_plot, base_width=9, base_height=11)
#EXON and CODON (SI)
#Now we combine both graphs using plot_grid()
#First combining both graphs without a legend
legend4 <- get_legend(plot_RF_ex + theme(legend.position="bottom"))
## Warning: Removed 59 rows containing missing values (geom_text_repel).
ex_cd_plots <- plot_grid(plot_RF_ex + theme(legend.position="none"),
plot_KC_ex + theme(legend.position="none"),
plot_RF_cd + theme(legend.position="none"),
plot_KC_cd + theme(legend.position="none"),
align = 'vh', labels = c("a)", "b)", "c)", "d)"), hjust = -1, nrow = 2)
## Warning: Removed 59 rows containing missing values (geom_text_repel).
## Warning: Removed 59 rows containing missing values (geom_text_repel).
## Warning: Removed 59 rows containing missing values (geom_text_repel).
## Warning: Removed 59 rows containing missing values (geom_text_repel).
ex_cd_plots
#Now adding the legend
ex_cd_plots_onelegend <- plot_grid(ex_cd_plots, legend4, ncol=1, rel_heights = c(3, .5))
ex_cd_plots_onelegend
#Finally adding a title
title <- ggdraw() + draw_label("Gene Tree Landscape", fontface='bold')
ex_cd_plot <- plot_grid(title, ex_cd_plots_onelegend, ncol=1, rel_heights=c(0.1, 1))
#Saving pdf determining the size of the figure:
save_plot("./SupFigureS20.pdf", ex_cd_plot, base_width=9, base_height=11)